Goto

Collaborating Authors

 measuring systematic generalization


Measuring Systematic Generalization in Neural Proof Generation with Transformers

Neural Information Processing Systems

We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs.


Review for NeurIPS paper: Measuring Systematic Generalization in Neural Proof Generation with Transformers

Neural Information Processing Systems

This paper evaluates a trained-from-scratch Transformer language model on an artificial simple-theorem-proving task in a way that helps to highlight and clarify some limitations of this commonly-used architecture. Reviewers found some points in the motivation and in the discussion of results potentially a bit misleading, especially surrounding the connection between this work and natural language, but ultimately formed a consensus that the primary claims of the paper are sound and significant, and that the remaining presentational issues don't undermine that.


Review for NeurIPS paper: Measuring Systematic Generalization in Neural Proof Generation with Transformers

Neural Information Processing Systems

Summary and Contributions: This paper evaluates how well Transformer language models can generate natural language expressions corresponding to first-order logical proofs, and their answers. Given a dataset of facts (tuples like entity1-relation1-entity2, entity2-relation2-entity3) and a query (entity1-?-entity3), the language model is trained on a sentence representing the facts, the query, a proof, and the answer. The proof is a chain of implications (for example, one step is "since entity1 is in relation1 with entity2 and entity2 is in relation2 with entity3, then entity1 is in relation2 with entity3"). The answer is the missing relation, such as relation2. The model can then be tested by presenting only the prefix of the expressions corresponding to the facts and the query (and perhaps the proof), and predicting the answer. The paper evaluates the ability of Transformer language models to generalize in several settings, determined by the number of relations.


Measuring Systematic Generalization in Neural Proof Generation with Transformers

Neural Information Processing Systems

We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs.


ORCHARD: A Benchmark For Measuring Systematic Generalization of Multi-Hierarchical Reasoning

Pung, Bill Tuck Weng, Chan, Alvin

arXiv.org Artificial Intelligence

The ability to reason with multiple hierarchical structures is an attractive and desirable property of sequential inductive biases for natural language processing. Do the state-of-the-art Transformers and LSTM architectures implicitly encode for these biases? To answer this, we propose ORCHARD, a diagnostic dataset for systematically evaluating hierarchical reasoning in state-of-the-art neural sequence models. While there have been prior evaluation frameworks such as ListOps or Logical Inference, our work presents a novel and more natural setting where our models learn to reason with multiple explicit hierarchical structures instead of only one, i.e., requiring the ability to do both long-term sequence memorizing, relational reasoning while reasoning with hierarchical structure. Consequently, backed by a set of rigorous experiments, we show that (1) Transformer and LSTM models surprisingly fail in systematic generalization, and (2) with increased references between hierarchies, Transformer performs no better than random.